Team member:
Trần Minh Quân 20203233
Lê Bá Đức 20203208
Mai Đức Huy 20203217
Giang Thị Thanh Huyền 20223489
Trương Thị Thu Thảo 20223527
The data set includes data from a direct marketer who sells his products only via direct mail. He sends catalogs with product characteristics to customers who then order directly from the catalogs. The marketer has developed customer records to learn what makes some customers spend more than others.
The objective of this predictive modeling exercise is to predict the amount that will be spent in terms of the provided customer characteristics for a direct marketer who sells his products via mail. This analysis will be useful for the marketer to make strategic decisions about advertising and targetting a selected group of potential customers based on the amount that they are predicted to spend in the future.
The dataset DirectMarketing.csv contains 1000 records
and 18 attributes. But only the first 10 columns are significant, 8
remaining columns are just one-hot encoded attributes from the first 10
row. As one-hot encoding is only used for linear regression part, so we
decided to remove 8 last rows in Exploratory Data Analysis part for
better speed.
0. Introduction
1. Importing
Libraries
2. Data
3.
Exploratory Data Analysis
- 3.1 Missing Value
- 3.2 Correlation Matrix
-
3.3 Describe Function
4. Data Analysis
- 4.1 Age
- 4.2 Gender
- 4.3 Own Home
- 4.4 Married
- 4.5
Location
- 4.6 Children
-
4.7 History
- 4.8 Catalogs
- 4.9 Amount Spent
- 4.10
Salary
# Install Tidyverse for data manipulation and visualization
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install MASS for statistical functions
install.packages("MASS")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install glmnet for Lasso and Ridge regression
install.packages("glmnet")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install boot for bootstrapping methods
install.packages("boot")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install locfit for local regression
install.packages("locfit")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install caret for machine learning modeling
install.packages("caret")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install corrplot for correlation plot visualization
install.packages("corrplot")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install ggplot2 for advanced data visualization
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install reshape2 for data reshaping
install.packages("reshape2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install dplyr for data manipulation
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install plotly for interactive plots
install.packages("plotly")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
# Install gridExtra for customizing plot layouts
install.packages("gridExtra")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
install.packages("Hmisc")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8
library(boot)
library(locfit)
## locfit 1.5-9.8 2023-06-11
##
## Attaching package: 'locfit'
##
## The following object is masked from 'package:purrr':
##
## none
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
##
## The following object is masked from 'package:boot':
##
## melanoma
##
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(corrplot)
## corrplot 0.92 loaded
library(glmnet)
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(dplyr)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:MASS':
##
## select
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
Load the data and look at its dimansions.
d_dimar <- read.csv('DirectMarketing.csv')
dim(d_dimar)
## [1] 1000 18
Look at the first few rows to get to know the contents.
head(d_dimar)
## Age Gender OwnHome Married Location Salary Children History Catalogs
## 1 Old Female Own Single Far 47500 0 High 6
## 2 Middle Male Rent Single Close 63600 0 High 6
## 3 Young Female Rent Single Close 13500 0 Low 18
## 4 Middle Male Own Married Close 85600 1 High 18
## 5 Middle Female Own Single Close 68400 0 High 12
## 6 Young Male Own Married Close 30400 0 Low 6
## AmountSpent Gender_b Married_b Location_b Ownhome_b Age_y Age_m Hist_m Hist_h
## 1 755 1 0 0 1 0 0 0 1
## 2 1318 0 0 1 0 0 1 0 1
## 3 296 1 0 1 0 1 0 0 0
## 4 2436 0 1 1 1 0 1 0 1
## 5 1304 1 0 1 1 0 1 0 1
## 6 495 0 1 1 1 1 0 0 0
Now we gonna take only first 10 columns for data analysis
d_mar <- d_dimar[, 1:10]
dim(d_mar)
## [1] 1000 10
head(d_mar)
## Age Gender OwnHome Married Location Salary Children History Catalogs
## 1 Old Female Own Single Far 47500 0 High 6
## 2 Middle Male Rent Single Close 63600 0 High 6
## 3 Young Female Rent Single Close 13500 0 Low 18
## 4 Middle Male Own Married Close 85600 1 High 18
## 5 Middle Female Own Single Close 68400 0 High 12
## 6 Young Male Own Married Close 30400 0 Low 6
## AmountSpent
## 1 755
## 2 1318
## 3 296
## 4 2436
## 5 1304
## 6 495
We have 3 numerical features and 6 categorical features. Now we move to data analysis part.
# 3.Exploratory Data Analysis * Exploratory Data
Analysis refers to the critical process of performing initial
investigations on data so as to discover patterns,to spot anomalies, to
test hypothesis and to check assumptions with the help of summary
statistics and graphical representations.
Your goal during EDA is to develop an understanding of your data.
The easiest way to do this is to use questions as tools to guide your
investigation. When you ask a question, the question focuses your
attention on a specific part of your dataset and helps you decide which
graphs, models, or transformations to make.
Generate questions about your data.
Search for answers by visualising, transforming, and modelling
your data.
Use what you learn to refine your questions and/or generate new
questions.
EDA is not a formal process with a strict set of
rules.
More than anything, EDA is a state of mind.
During the initial phases of EDA you should feel free to
investigate every idea that occurs to you.
Some of these ideas will pan out, and some will be dead
ends.
As your exploration continues, you will home in on a few
particularly productive areas that you’ll eventually write up and
communicate to others.
duplicates <- d_mar[duplicated(d_mar), ]
duplicates
## [1] Age Gender OwnHome Married Location Salary
## [7] Children History Catalogs AmountSpent
## <0 rows> (or 0-length row.names)
We don’t have duplicated data
any(is.na(d_mar))
## [1] TRUE
colSums(is.na(d_mar))
## Age Gender OwnHome Married Location Salary
## 0 0 0 0 0 0
## Children History Catalogs AmountSpent
## 0 303 0 0
In this step, we add another level to the History columns and replace the missing values with ‘Never’ to represent the customers who have not yet purchased and print out the levels.
levs <- levels(d_mar$History)
levs[length(levs)+1] <- "Never"
d_mar$History <- factor(d_mar$History, levels=levs)
d_mar$History[is.na(d_mar$History)] <- "Never"
d_mar$History <- ordered(d_mar$History, levels=c("Never", "Low", "Medium", "High"))
print('After replacing missing values NA with Never:')
## [1] "After replacing missing values NA with Never:"
levels(d_mar$History)
## [1] "Never" "Low" "Medium" "High"
Look at the distribution of customers according to History
table(d_mar$History)
##
## Never Low Medium High
## 1000 0 0 0
numeric_columns <- sapply(d_dimar, is.numeric)
corr_matrix <- cor(d_dimar[, numeric_columns], use = "complete.obs")
corr_matrix
## Salary Children Catalogs AmountSpent Gender_b
## Salary 1.00000000 0.049663163 0.18355086 0.6995957 -0.261492181
## Children 0.04966316 1.000000000 -0.11345543 -0.2223082 0.105469083
## Catalogs 0.18355086 -0.113455428 1.00000000 0.4726499 -0.087350767
## AmountSpent 0.69959571 -0.222308170 0.47264989 1.0000000 -0.201690213
## Gender_b -0.26149218 0.105469083 -0.08735077 -0.2016902 1.000000000
## Married_b 0.67563308 0.009770249 0.13705989 0.4758800 -0.116057285
## Location_b 0.03712709 -0.002391455 -0.12858075 -0.2526157 -0.005553971
## Ownhome_b 0.46073640 -0.032274083 0.09313151 0.3508080 -0.084433317
## Age_y -0.58857078 0.073527118 -0.15887218 -0.4346918 0.113978982
## Age_m 0.52905162 0.244719648 0.11408342 0.3013953 -0.204232847
## Hist_m -0.01237083 -0.041939180 0.01604679 -0.1438307 -0.025799164
## Hist_h 0.52469002 -0.273361609 0.28493154 0.5903957 -0.160750813
## Married_b Location_b Ownhome_b Age_y Age_m
## Salary 0.675633080 0.037127094 0.460736395 -0.58857078 0.52905162
## Children 0.009770249 -0.002391455 -0.032274083 0.07352712 0.24471965
## Catalogs 0.137059886 -0.128580754 0.093131508 -0.15887218 0.11408342
## AmountSpent 0.475879979 -0.252615659 0.350807999 -0.43469185 0.30139535
## Gender_b -0.116057285 -0.005553971 -0.084433317 0.11397898 -0.20423285
## Married_b 1.000000000 0.006964058 0.264009318 -0.28328892 0.15595721
## Location_b 0.006964058 1.000000000 0.033691291 -0.03298183 0.04108406
## Ownhome_b 0.264009318 0.033691291 1.000000000 -0.46929874 0.25164907
## Age_y -0.283288923 -0.032981833 -0.469298744 1.00000000 -0.64468192
## Age_m 0.155957211 0.041084058 0.251649074 -0.64468192 1.00000000
## Hist_m 0.027285078 0.088858299 -0.001919144 -0.13977934 0.09447173
## Hist_h 0.353280416 -0.207556702 0.277386271 -0.33568013 0.17191158
## Hist_m Hist_h
## Salary -0.012370830 0.5246900
## Children -0.041939180 -0.2733616
## Catalogs 0.016046789 0.2849315
## AmountSpent -0.143830654 0.5903957
## Gender_b -0.025799164 -0.1607508
## Married_b 0.027285078 0.3532804
## Location_b 0.088858299 -0.2075567
## Ownhome_b -0.001919144 0.2773863
## Age_y -0.139779338 -0.3356801
## Age_m 0.094471729 0.1719116
## Hist_m 1.000000000 -0.3034567
## Hist_h -0.303456731 1.0000000
# Melt the correlation matrix for ggplot
melted_corr_matrix <- melt(corr_matrix)
# Create the heatmap plot and assign it to a variable
heatmap_plot <- ggplot(data = melted_corr_matrix, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
axis.text.y = element_text(size = 12)) +
labs(x = '', y = '', title = 'Correlation Matrix') +
geom_text(aes(label = sprintf("%.2f", value)), vjust = 1, size = 3)
# Print the plot with larger dimensions
print(heatmap_plot)
# Save the plot with larger dimensions
ggsave("heatmap_plot.png", plot = heatmap_plot, width = 10, height = 8, dpi = 300)
summary(d_mar)
## Age Gender OwnHome Married
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Location Salary Children History
## Length:1000 Min. : 10100 Min. :0.000 Never :1000
## Class :character 1st Qu.: 29975 1st Qu.:0.000 Low : 0
## Mode :character Median : 53700 Median :1.000 Medium: 0
## Mean : 56104 Mean :0.934 High : 0
## 3rd Qu.: 77025 3rd Qu.:2.000
## Max. :168800 Max. :3.000
## Catalogs AmountSpent
## Min. : 6.00 Min. : 38.0
## 1st Qu.: 6.00 1st Qu.: 488.2
## Median :12.00 Median : 962.0
## Mean :14.68 Mean :1216.8
## 3rd Qu.:18.00 3rd Qu.:1688.5
## Max. :24.00 Max. :6217.0
Take a look at the structure of our data frame, looks good for a linear regression model.
str(d_mar)
## 'data.frame': 1000 obs. of 10 variables:
## $ Age : chr "Old" "Middle" "Young" "Middle" ...
## $ Gender : chr "Female" "Male" "Female" "Male" ...
## $ OwnHome : chr "Own" "Rent" "Rent" "Own" ...
## $ Married : chr "Single" "Single" "Single" "Married" ...
## $ Location : chr "Far" "Close" "Close" "Close" ...
## $ Salary : int 47500 63600 13500 85600 68400 30400 48100 68400 51900 80700 ...
## $ Children : int 0 0 0 1 0 0 0 0 3 0 ...
## $ History : Ord.factor w/ 4 levels "Never"<"Low"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Catalogs : int 6 6 18 18 12 6 12 18 6 18 ...
## $ AmountSpent: int 755 1318 296 2436 1304 495 782 1155 158 3034 ...
d_mar_Age <- d_mar %>%
count(Age) %>%
rename(count = n)
d_mar_Age
## Age count
## 1 Middle 508
## 2 Old 205
## 3 Young 287
fig <- plot_ly(d_mar_Age, labels = ~Age, values = ~count, type = 'pie', hole = 0.4) %>%
layout(title = 'Age Distribution', xaxis = list(title = 'Age'), yaxis = list(title = 'Count'))
fig
fig <- plot_ly(d_mar_Age, x = ~Age, y = ~count, type = 'bar', marker = list(colorscale = 'Viridis')) %>%
layout(title = 'Age Distribution', xaxis = list(title = 'Age'), yaxis = list(title = 'Count'))
fig
d_mar_Age_Salary <- d_mar %>% group_by(Age) %>% summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))
d_mar_Age_AmountSpent <- d_mar %>% group_by(Age) %>% summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))
result <- merge(d_mar_Age_Salary, d_mar_Age_AmountSpent, by = "Age")
result
## Age AVG_Salary AVG_AmountSpent
## 1 Middle 72036.42 1501.69
## 2 Old 56365.85 1432.13
## 3 Young 27715.68 558.62
fig <- subplot(
plot_ly(result, x = ~Age, y = ~AVG_Salary, type = 'bar', name = 'Mean Salary', marker = list(colorscale = 'fall')),
plot_ly(result, x = ~Age, y = ~AVG_AmountSpent, type = 'bar', name = 'Mean Amount Spent', marker = list(colorscale = 'fall')),
nrows = 2
) %>% layout(title = 'Age', xaxis = list(title = 'Age'), yaxis = list(title = 'Value'))
fig
# Plot 1: Mean Salary
p1 <- ggplot(result, aes(x = Age, y = AVG_Salary, fill = AVG_Salary)) +
geom_bar(stat = "identity") +
scale_fill_gradientn(colours = rainbow(4)) +
labs(title = "Age AVG Salary", x = "Age", y = "AVG Salary") +
theme_minimal()
# Plot 2: Mean Amount Spent
p2 <- ggplot(result, aes(x = Age, y = AVG_AmountSpent, fill = AVG_AmountSpent)) +
geom_bar(stat = "identity") +
scale_fill_gradientn(colours = rainbow(4)) +
labs(title = "Age AVG Amount Spent", x = "Age", y = "AVG Amount Spent") +
theme_minimal()
# Combine plots
grid.arrange(p1, p2, nrow = 2)
# Scatter plot with trendline
p <- ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = as.factor(Age))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_color_brewer(palette = "Set1") +
labs(title = 'Age With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent', color = 'Age')
print(p)
## `geom_smooth()` using formula = 'y ~ x'
d_mar_Gender <- d_mar %>%
count(Gender) %>%
rename(count = n)
d_mar_Gender
## Gender count
## 1 Female 506
## 2 Male 494
fig <- plot_ly(d_mar_Gender, x = ~Gender, y = ~count, type = 'bar', marker = list(color = ~count, colorscale = 'Viridis'), text = ~count, textposition = 'outside') %>%
layout(title = 'Gender Distribution', xaxis = list(title = 'Gender'), yaxis = list(title = 'Count'))
fig
# Calculate average salary and amount spent by gender
d_mar_Gender_Salary <- d_mar %>%
group_by(Gender) %>%
summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))
d_mar_Gender_AmountSpent <- d_mar %>%
group_by(Gender) %>%
summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))
# Combine the two data frames
result <- merge(d_mar_Gender_Salary, d_mar_Gender_AmountSpent, by = "Gender")
# Create individual plots
plot1 <- ggplot(result, aes(x = Gender, y = AVG_Salary, fill = Gender)) +
geom_bar(stat = "identity") +
labs(title = "Gender AVG Salary", x = "Gender", y = "Average Salary") +
theme_minimal()
plot2 <- ggplot(result, aes(x = Gender, y = AVG_AmountSpent, fill = Gender)) +
geom_bar(stat = "identity") +
labs(title = "Gender AVG Amount Spent", x = "Gender", y = "Average Amount Spent") +
theme_minimal()
# Combine plots into a single figure with subplots
grid.arrange(plot1, plot2, nrow = 2)
# Scatter plot with trendline
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = Gender)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) + # Linear model trendline without confidence interval
scale_color_brewer(palette = "Set1") + # Color by Gender
labs(title = 'Gender With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
d_mar_G_and_A <- d_mar %>%
count(Gender, Age) %>%
rename(count = n)
d_mar_G_and_A
## Gender Age count
## 1 Female Middle 206
## 2 Female Old 129
## 3 Female Young 171
## 4 Male Middle 302
## 5 Male Old 76
## 6 Male Young 116
d_mar_G_and_A <- d_mar %>%
group_by(Gender, Age) %>%
summarise(count = n())
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
ggplot(d_mar_G_and_A, aes(x = Age, y = count, fill = Gender)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set1") +
labs(title = "Age Count With Gender", x = "Age", y = "Count") +
theme_minimal()
d_mar_G_and_A <- d_mar %>%
group_by(Gender, Age) %>%
summarise(AVG_AmountSpent = mean(AmountSpent, na.rm = TRUE)) %>%
mutate(AVG_AmountSpent = round(AVG_AmountSpent, 2))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
d_mar_G_and_A
## # A tibble: 6 × 3
## # Groups: Gender [2]
## Gender Age AVG_AmountSpent
## <chr> <chr> <dbl>
## 1 Female Middle 1301.
## 2 Female Old 1279.
## 3 Female Young 501.
## 4 Male Middle 1638.
## 5 Male Old 1692.
## 6 Male Young 643.
Highest average spending:Male Old
1691
Lowest average spending: Female
Young 501
# Calculating various statistics by Gender and Age
d_mar_G_and_A_AVG <- d_mar %>%
group_by(Gender, Age) %>%
summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
d_mar_G_and_A_Max <- d_mar %>%
group_by(Gender, Age) %>%
summarise(Max_AmountSpent = max(AmountSpent, na.rm = TRUE))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
d_mar_G_and_A_Min <- d_mar %>%
group_by(Gender, Age) %>%
summarise(Min_AmountSpent = min(AmountSpent, na.rm = TRUE))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
d_mar_G_and_A_Count <- d_mar %>%
group_by(Gender, Age) %>%
summarise(Count = n())
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
# Merging the data frames
result <- reduce(list(d_mar_G_and_A_AVG, d_mar_G_and_A_Max, d_mar_G_and_A_Min, d_mar_G_and_A_Count), full_join, by = c("Gender", "Age"))
result
## # A tibble: 6 × 6
## # Groups: Gender [2]
## Gender Age AVG_AmountSpent Max_AmountSpent Min_AmountSpent Count
## <chr> <chr> <dbl> <int> <int> <int>
## 1 Female Middle 1301. 5830 158 206
## 2 Female Old 1279. 5564 65 129
## 3 Female Young 501. 3688 47 171
## 4 Male Middle 1638. 5878 157 302
## 5 Male Old 1692. 6217 297 76
## 6 Male Young 643. 1692 38 116
# Reshape the data from wide to long format for faceting
long_result <- result %>%
gather(key = "Statistic", value = "Value", AVG_AmountSpent, Min_AmountSpent, Max_AmountSpent, Count)
# Create the plot
ggplot(long_result, aes(x = paste(Gender, Age), y = Value, fill = Statistic)) +
geom_bar(stat = "identity", position = position_dodge()) +
facet_wrap(~ Statistic, scales = "free_y", ncol = 1) +
theme_minimal() +
labs(title = "Gender Age With Amount Spent", x = "Gender and Age", y = "Value") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Gender distribution is balanced
Men earn more and spend more
The highest number of customers is middle age men
Lowest number of customers older men
Highest average spending:Male Old 1691
Lowest average spending: Female Young 501
Highest average Salary:Male Middle 76.3 k
Lowest average Salary: Female Young 25.5 k
d_mar_OwnHome <- d_mar %>%
count(OwnHome) %>%
rename(count = n)
d_mar_OwnHome
## OwnHome count
## 1 Own 516
## 2 Rent 484
# Assuming d_mar_OwnHome is already created and contains 'OwnHome' and 'count' columns
fig <- plot_ly(d_mar_OwnHome, labels = ~OwnHome, values = ~count, type = 'pie', marker = list(colors = c('darkblue', 'darkcyan'))) %>%
layout(title = 'Own Home Count')
fig
OwnHome distribution is balanced
What are the customers’ average income and expenses by OwnHome?
# Calculate average salary and amount spent by OwnHome
d_mar_OwnHome_Salary <- d_mar %>%
group_by(OwnHome) %>%
summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))
d_mar_OwnHome_AmountSpent <- d_mar %>%
group_by(OwnHome) %>%
summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))
# Combine the data
result <- merge(d_mar_OwnHome_Salary, d_mar_OwnHome_AmountSpent, by = "OwnHome")
# Plotting using ggplot2 and gridExtra
plot1 <- ggplot(result, aes(x = OwnHome, y = AVG_Salary, fill = OwnHome)) +
geom_bar(stat = "identity") +
labs(title = "Own Home AVG Salary", x = "Own Home", y = "Average Salary") +
theme_minimal()
plot2 <- ggplot(result, aes(x = OwnHome, y = AVG_AmountSpent, fill = OwnHome)) +
geom_bar(stat = "identity") +
labs(title = "Own Home AVG Amount Spent", x = "Own Home", y = "Average Amount Spent") +
theme_minimal()
library(gridExtra)
grid.arrange(plot1, plot2, nrow = 2)
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = OwnHome)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) + # Linear model trendline
scale_color_brewer(palette = "Set1") + # Color by OwnHome
labs(title = 'Own Home With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
d_mar_Married <- d_mar %>%
count(Married) %>%
rename(count = n)
fig <- plot_ly(d_mar_Married, labels = ~Married, values = ~count, type = 'pie', marker = list(colors = c('darkblue', 'darkcyan'))) %>%
layout(title = 'Married Count')
fig
# Calculating average salary and amount spent by Married status
d_mar_Married_Salary <- d_mar %>%
group_by(Married) %>%
summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))
d_mar_Married_AmountSpent <- d_mar %>%
group_by(Married) %>%
summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))
# Combine the data
result <- merge(d_mar_Married_Salary, d_mar_Married_AmountSpent, by = "Married")
# Plotting using ggplot2 and gridExtra
plot1 <- ggplot(result, aes(x = Married, y = AVG_Salary, fill = Married)) +
geom_bar(stat = "identity") +
labs(title = "Married AVG Salary", x = "Married", y = "Average Salary") +
theme_minimal()
plot2 <- ggplot(result, aes(x = Married, y = AVG_AmountSpent, fill = Married)) +
geom_bar(stat = "identity") +
labs(title = "Married AVG Amount Spent", x = "Married", y = "Average Amount Spent") +
theme_minimal()
grid.arrange(plot1, plot2, nrow = 2)
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = Married)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) + # Linear model trendline
scale_color_brewer(palette = "Set1") + # Color by Married status
labs(title = 'Married With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
How is the Location distribution ?
d_mar_Location <- d_mar %>%
count(Location) %>%
rename(count = n)
ggplot(d_mar_Location, aes(x = Location, y = count, fill = count)) +
geom_bar(stat = "identity") +
scale_fill_viridis_c() +
labs(title = "Location Distribution", x = "Location", y = "Count") +
theme_minimal() +
geom_text(aes(label = count), vjust = -0.5)
# Boxplot for Salary by Location
plot_salary <- ggplot(d_mar, aes(x = Location, y = Salary)) +
geom_boxplot() +
labs(title = "Salary Distribution by Location", x = "Location", y = "Salary") +
theme_minimal()
# Boxplot for AmountSpent by Location
plot_amount_spent <- ggplot(d_mar, aes(x = Location, y = AmountSpent)) +
geom_boxplot() +
labs(title = "Amount Spent Distribution by Location", x = "Location", y = "Amount Spent") +
theme_minimal()
# Arrange the plots
grid.arrange(plot_salary, plot_amount_spent, nrow = 2)
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = Location)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) + # Linear model trendline
scale_color_brewer(palette = "Set1") + # Color by Location
labs(title = 'Location With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
d_mar_Children <- d_mar %>%
count(Children) %>%
rename(count = n)
d_mar_Children
## Children count
## 1 0 462
## 2 1 267
## 3 2 146
## 4 3 125
fig <- plot_ly(d_mar_Children, labels = ~Children, values = ~count, type = 'pie', marker = list(colors = c('darkblue', 'darkcyan', 'CadetBlue', 'DarkSeaGreen'))) %>%
layout(title = 'Children Count')
fig
fig <- plot_ly(d_mar_Children, x = ~Children, y = ~count, type = 'scatter', mode = 'markers', marker = list(color = ~count, size = ~count * 0.1, showscale = TRUE)) %>%
layout(title = 'Children Distribution', xaxis = list(title = 'Children Count'), yaxis = list(title = 'Number Of Customers'))
fig
46 percent of customers don’t have Children
What are the customers’ average income and expenses by Children?
# Calculating average salary and amount spent by Children
d_mar_Children_Salary <- d_mar %>%
group_by(Children) %>%
summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))
d_mar_Children_AmountSpent <- d_mar %>%
group_by(Children) %>%
summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))
# Combine the data
result <- merge(d_mar_Children_Salary, d_mar_Children_AmountSpent, by = "Children")
# Plotting using ggplot2 and gridExtra
plot_salary <- ggplot(result, aes(x = Children, y = AVG_Salary)) +
geom_bar(stat = "identity", aes(fill = Children)) +
labs(title = "Children AVG Salary", x = "Number of Children", y = "Average Salary") +
theme_minimal()
plot_amount_spent <- ggplot(result, aes(x = Children, y = AVG_AmountSpent)) +
geom_bar(stat = "identity", aes(fill = Children)) +
labs(title = "Children AVG Amount Spent", x = "Number of Children", y = "Average Amount Spent") +
theme_minimal()
# Arrange the plots
grid.arrange(plot_salary, plot_amount_spent, nrow = 2)
# Scatter plot with trendline for Salary vs Amount Spent by Children
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = as.factor(Children))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) + # Linear model trendline
scale_color_brewer(palette = "Set1") + # Color by number of Children
labs(title = 'Children With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Calculate average salary and amount spent by History
d_mar_History_Salary <- d_mar %>%
group_by(History) %>%
summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))
d_mar_History_AmountSpent <- d_mar %>%
group_by(History) %>%
summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))
# Combine the data
result <- merge(d_mar_History_Salary, d_mar_History_AmountSpent, by = "History")
# Plotting using ggplot2 and gridExtra
plot_salary <- ggplot(result, aes(x = History, y = AVG_Salary, fill = History)) +
geom_bar(stat = "identity") +
labs(title = "History AVG Salary", x = "History", y = "Average Salary") +
theme_minimal()
plot_amount_spent <- ggplot(result, aes(x = History, y = AVG_AmountSpent, fill = History)) +
geom_bar(stat = "identity") +
labs(title = "History AVG Amount Spent", x = "History", y = "Average Amount Spent") +
theme_minimal()
grid.arrange(plot_salary, plot_amount_spent, nrow = 2)
# Scatter plot with trendline for Salary vs Amount Spent by History
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = History)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) + # Linear model trendline
scale_color_brewer(palette = "Set1") + # Color by History
labs(title = 'History With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
d_mar_Catalogs <- d_mar %>%
count(Catalogs) %>%
rename(count = n)
d_mar_Catalogs
## Catalogs count
## 1 6 252
## 2 12 282
## 3 18 233
## 4 24 233
fig <- plot_ly(d_mar_Catalogs, labels = ~Catalogs, values = ~count, type = 'pie', marker = list(colors = c('darkblue', 'darkcyan', 'CadetBlue', 'DarkSeaGreen'))) %>%
layout(title = 'Catalogs Count')
fig
# Calculating average salary and amount spent by Catalogs
d_mar_Catalogs_Salary <- d_mar %>%
group_by(Catalogs) %>%
summarise(AVG_Salary = round(mean(Salary, na.rm = TRUE), 2))
d_mar_Catalogs_AmountSpent <- d_mar %>%
group_by(Catalogs) %>%
summarise(AVG_AmountSpent = round(mean(AmountSpent, na.rm = TRUE), 2))
# Combine the data
result <- merge(d_mar_Catalogs_Salary, d_mar_Catalogs_AmountSpent, by = "Catalogs")
# Plotting using ggplot2 and gridExtra
library(ggplot2)
library(gridExtra)
plot_salary <- ggplot(result, aes(x = Catalogs, y = AVG_Salary)) +
geom_bar(stat = "identity", aes(fill = Catalogs)) +
labs(title = "Catalogs AVG Salary", x = "Catalogs", y = "Average Salary") +
theme_minimal()
plot_amount_spent <- ggplot(result, aes(x = Catalogs, y = AVG_AmountSpent)) +
geom_bar(stat = "identity", aes(fill = Catalogs)) +
labs(title = "Catalogs AVG Amount Spent", x = "Catalogs", y = "Average Amount Spent") +
theme_minimal()
grid.arrange(plot_salary, plot_amount_spent, nrow = 2)
ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = as.factor(Catalogs))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) + # Linear model trendline
scale_color_brewer(palette = "Set1") + # Color by Catalogs
labs(title = 'Catalogs With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(d_mar, aes(x = AmountSpent)) +
geom_histogram(binwidth = 200, fill = "red", color = "black") +
scale_x_continuous(limits = c(0, 6000), breaks = seq(0, 6000, by = 200)) +
labs(title = "Distribution Of Amount Spent", x = "Amount Spent", y = "Counts") +
theme_minimal()
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 2 rows containing missing values (`geom_bar()`).
ggplot(d_mar, aes(x = "", y = AmountSpent)) +
geom_boxplot(fill = "royalblue", color = "black") +
stat_summary(fun.data = "mean_sdl", geom = "errorbar", color = "red", width = 0.5) +
labs(title = "Amount Spent Distribution", y = "Amount Spent") +
theme_minimal()
ggplot(d_mar, aes(x = factor(1), y = AmountSpent)) +
geom_violin(fill = "lightseagreen", color = "black", alpha = 0.6) +
geom_boxplot(width = 0.1, fill = "white", color = "black") +
labs(title = "Amount Spent Distribution", y = "Amount Spent") +
theme_minimal() +
theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank())
ggplot(d_mar, aes(x = Salary)) +
geom_histogram(binwidth = 5000, fill = "red", color = "black") +
scale_x_continuous(limits = c(0, 150000), breaks = seq(0, 150000, by = 5000)) +
labs(title = "Distribution Of Salary", x = "Salary", y = "Counts") +
theme_minimal()
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 2 rows containing missing values (`geom_bar()`).
ggplot(d_mar, aes(x = factor(1), y = Salary)) +
geom_boxplot(fill = "royalblue", color = "black") +
stat_summary(fun.data = "mean_sdl", geom = "errorbar", color = "red", width = 0.5) +
labs(title = "Salary Distribution") +
theme_minimal() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplot(d_mar, aes(x = factor(1), y = Salary)) +
geom_violin(fill = "lightseagreen", color = "black", alpha = 0.6) +
geom_boxplot(width = 0.1, fill = "white", color = "black") +
labs(title = "Salary Distribution") +
theme_minimal() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
Data has only object and integer values.
Dataset comprises of 1000 observations and 10 characteristics.
We don’t have duplicated data
We have already said in the data set that null values are customers who have not exchanged with us in the past, we do not have missing data.
Most of the customers are in the middle age group
Middle age group earns more and spends more
Gender distribution is balanced
Men earn more and spend more
The highest number of customers is middle age men
Lowest number of customers older men
Highest average spending:Male Old 1691
Lowest average spending: Female Young 501
Highest average Salary:Male Middle 76.3 k
Lowest average Salary: Female Young 25.5 k
OwnHome distribution is balanced
Homeowners earn more and spend more
Married distribution is balanced
Married people earn more and spend more
Most customers are close to the nearest physical store that sells similar products
Customers who are close to the nearest physical store selling similar products have lower spend , although their income is higher
46 percent of customers don’t have Children
Although customers’ incomes were close, there was a decrease in spending as the number of children increased
The highest number of customers who prefer us for the first time
Customers with high previous purchasing volume are the group with the highest income and expenditure
Catalogs distribution is balanced
As the number of catalogs sent increases, so does the expenditure and income
Amount Spent max:6217
Amount Spent mean:1216
Amount Spent median:962
Amount Spent min:38
Salary max:168.8 k
Salary mean:56.1 k
Salary median:53.7 k
Salary min:10.1 k
In this part, we will come back to d_dimar dataframe
containing 18 columns, and the last 8 columns is used for linear
regression. Firstly, we will remind you about d_dimar
dataframe:
head(d_dimar)
## Age Gender OwnHome Married Location Salary Children History Catalogs
## 1 Old Female Own Single Far 47500 0 High 6
## 2 Middle Male Rent Single Close 63600 0 High 6
## 3 Young Female Rent Single Close 13500 0 Low 18
## 4 Middle Male Own Married Close 85600 1 High 18
## 5 Middle Female Own Single Close 68400 0 High 12
## 6 Young Male Own Married Close 30400 0 Low 6
## AmountSpent Gender_b Married_b Location_b Ownhome_b Age_y Age_m Hist_m Hist_h
## 1 755 1 0 0 1 0 0 0 1
## 2 1318 0 0 1 0 0 1 0 1
## 3 296 1 0 1 0 1 0 0 0
## 4 2436 0 1 1 1 0 1 0 1
## 5 1304 1 0 1 1 0 1 0 1
## 6 495 0 1 1 1 1 0 0 0
Now, let’s see again the correlation matrix between quantitative attributes:
# Melt the correlation matrix for ggplot
melted_corr_matrix <- melt(corr_matrix)
# Create the heatmap plot and assign it to a variable
heatmap_plot <- ggplot(data = melted_corr_matrix, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
axis.text.y = element_text(size = 12)) +
labs(x = '', y = '', title = 'Correlation Matrix') +
geom_text(aes(label = sprintf("%.2f", value)), vjust = 1, size = 3)
# Print the plot with larger dimensions
print(heatmap_plot)
# Save the plot with larger dimensions
ggsave("heatmap_plot.png", plot = heatmap_plot, width = 10, height = 8, dpi = 300)
par(mfrow=c(2,1))
plot(density(d_mar$AmountSpent), main="Density-AmountSpent", xlab="Amount Spent")
plot(density(d_mar$Salary), main="Density-Salary", xlab="Salary")
We would expect a linear relation in Amount spent and Salary of customers. Let us see if it is in fact true.
Now consider the familiar model:
\(AmountSpent = \beta_0 + \beta_1 * Salary\)
fit <- lm(AmountSpent ~ Salary, data = d_dimar)
summary(fit)
##
## Call:
## lm(formula = AmountSpent ~ Salary, data = d_dimar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2179.7 -315.2 -53.5 279.7 3752.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.31783 45.37416 -0.338 0.736
## Salary 0.02196 0.00071 30.930 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 687.1 on 998 degrees of freedom
## Multiple R-squared: 0.4894, Adjusted R-squared: 0.4889
## F-statistic: 956.7 on 1 and 998 DF, p-value: < 2.2e-16
The intercept could be removed since it is not significant from the test
fit <- lm(AmountSpent ~ 0 + Salary, data = d_dimar)
summary(fit)
##
## Call:
## lm(formula = AmountSpent ~ 0 + Salary, data = d_dimar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2159.5 -322.0 -60.8 277.5 3761.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Salary 0.0217504 0.0003398 64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 686.8 on 999 degrees of freedom
## Multiple R-squared: 0.8039, Adjusted R-squared: 0.8038
## F-statistic: 4097 on 1 and 999 DF, p-value: < 2.2e-16
The second model, which forces the line through the origin (no intercept), suggests a stronger relationship between Salary and AmountSpent, evidenced by a higher R-squared value (0.8039 vs. 0.4894). This could indicate that the true relationship may indeed pass through the origin, or it may be an artifact of this particular dataset. However, excluding the intercept can sometimes lead to misleading interpretations, and it’s crucial to consider whether it makes theoretical sense for the intercept to be zero in the context of the data.
# Scatter plot with trendline
p <- ggplot(d_mar, aes(x = Salary, y = AmountSpent, color = as.factor(Age))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_color_brewer(palette = "Set1") +
labs(title = 'Age With Salary Vs Amount Spent', x = 'Salary', y = 'Amount Spent', color = 'Age')
print(p)
## `geom_smooth()` using formula = 'y ~ x'
fit <- lm(AmountSpent ~ Children, data = d_dimar)
summary(fit)
##
## Call:
## lm(formula = AmountSpent ~ Children, data = d_dimar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1300.6 -669.7 -253.4 422.4 4810.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1406.63 39.67 35.460 < 2e-16 ***
## Children -203.27 28.22 -7.203 1.16e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 937.5 on 998 degrees of freedom
## Multiple R-squared: 0.04942, Adjusted R-squared: 0.04847
## F-statistic: 51.89 on 1 and 998 DF, p-value: 1.157e-12
fit <- lm(AmountSpent ~ 0 + Salary, data = d_dimar)
summary(fit)
##
## Call:
## lm(formula = AmountSpent ~ 0 + Salary, data = d_dimar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2159.5 -322.0 -60.8 277.5 3761.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Salary 0.0217504 0.0003398 64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 686.8 on 999 degrees of freedom
## Multiple R-squared: 0.8039, Adjusted R-squared: 0.8038
## F-statistic: 4097 on 1 and 999 DF, p-value: < 2.2e-16
Model 1, predicting Amount Spent based on the number of Children, has a low R-squared, indicating that Children alone poorly predict spending.
Model 2, using Salary without an intercept, shows a much higher R-squared, suggesting Salary is a strong predictor of spending.
The significant negative coefficient for Children in Model 1 indicates that as the number of children increases, the amount spent decreases. However, the explanatory power of Salary on Amount Spent is far greater than that of the number of Children, as seen in the difference in R-squared values (0.8039 vs. 0.04942).
First, we will plot the density of amount spent based on the customer history.
ggplot(d_mar, aes(x=AmountSpent)) + geom_density(aes(group=History, fill=History), alpha=.3)
We can observe and infer a couple of things here. Firstly the obvious - customers with a High History tend to spend more and those with a low history tend to spend low. Secondly, the ones whoch we thought did not have a history also have a record of spending some amount. Thus, our assumption is incorrect. However, we will keep Never as a category and build models on it since we do not have any detailed information about the predictor. In a practical scenario, we’d go back and get more information about how this data as classified while storing it.
Next, we’ll look at the amount spend based on the age of customers.
fit <- lm(AmountSpent ~ Hist_m+Hist_h, data = d_dimar)
summary(fit)
##
## Call:
## lm(formula = AmountSpent ~ Hist_m + Hist_h, data = d_dimar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1621.1 -501.3 -219.2 315.9 4125.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 858.95 33.60 25.566 <2e-16 ***
## Hist_m 91.45 62.98 1.452 0.147
## Hist_h 1327.19 59.06 22.472 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 775.6 on 997 degrees of freedom
## Multiple R-squared: 0.3499, Adjusted R-squared: 0.3486
## F-statistic: 268.4 on 2 and 997 DF, p-value: < 2.2e-16
fit <- lm(AmountSpent ~0+ Hist_m+Hist_h, data = d_dimar)
summary(fit)
##
## Call:
## lm(formula = AmountSpent ~ 0 + Hist_m + Hist_h, data = d_dimar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1621.1 -59.2 316.5 747.4 4984.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Hist_m 950.40 68.51 13.87 <2e-16 ***
## Hist_h 2186.14 62.47 35.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 997.5 on 998 degrees of freedom
## Multiple R-squared: 0.5868, Adjusted R-squared: 0.586
## F-statistic: 708.6 on 2 and 998 DF, p-value: < 2.2e-16
Model 1, with an intercept, explains 35% of the variance in Amount Spent using customer history categories (medium and high), but the medium history variable is not a significant predictor. Model 2, without an intercept, attributes all of the Amount Spent variance to history categories and explains a higher variance (58.68%), with both history categories as significant predictors. The higher R-squared in Model 2 suggests that when we force the regression through the origin, the history categories alone account for more of the variance in spending, indicating their strong individual contributions to the model. However, caution is needed as omitting the intercept can lead to misestimation of effects.
ggplot(d_mar, aes(x=AmountSpent)) + geom_density(aes(group=Age, fill=Age), alpha=.3)
fit <- lm(AmountSpent ~ Age_m+Age_y, data = d_dimar)
summary(fit)
##
## Call:
## lm(formula = AmountSpent ~ Age_m + Age_y, data = d_dimar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1367.1 -546.1 -152.1 390.6 4784.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1432.13 60.48 23.678 <2e-16 ***
## Age_m 69.56 71.65 0.971 0.332
## Age_y -873.50 79.19 -11.030 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 866 on 997 degrees of freedom
## Multiple R-squared: 0.1897, Adjusted R-squared: 0.1881
## F-statistic: 116.7 on 2 and 997 DF, p-value: < 2.2e-16
fit <- lm(AmountSpent ~ 0+ Age_m+Age_y, data = d_dimar)
summary(fit)
##
## Call:
## lm(formula = AmountSpent ~ 0 + Age_m + Age_y, data = d_dimar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1344.7 -368.9 55.9 780.1 6217.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Age_m 1501.69 48.00 31.285 <2e-16 ***
## Age_y 558.62 63.86 8.747 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1082 on 998 degrees of freedom
## Multiple R-squared: 0.5139, Adjusted R-squared: 0.513
## F-statistic: 527.6 on 2 and 998 DF, p-value: < 2.2e-16
Model 1 indicates that the age category ‘young’ significantly predicts Amount Spent negatively, while ‘middle’ age has no significant effect. The model explains 18.97% of the variance in spending. Model 2, without an intercept, finds both age categories to be significant predictors, with ‘middle’ age having a positive association with spending. This model accounts for a higher variance in spending (51.39%). The absence of an intercept suggests that Age alone is believed to explain all variations in spending, which may not be realistic, but it indicates the strong influence of age categories on spending behavior.
The next plot shows the distribution of salaries of customers based on gender and age.
ggplot(data=d_mar, aes(x=Age, y=mean(Salary))) + geom_col() + facet_grid(vars(OwnHome), vars(Gender)) + ylab("Avg. Salary")
fit <- lm(AmountSpent ~ Age_m+Age_y + Salary + Ownhome_b + Gender_b , data = d_dimar)
summary(fit)
##
## Call:
## lm(formula = AmountSpent ~ Age_m + Age_y + Salary + Ownhome_b +
## Gender_b, data = d_dimar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2437.7 -340.8 -49.6 277.6 3494.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.220e+02 8.234e+01 2.696 0.00713 **
## Age_m -2.879e+02 5.896e+01 -4.884 1.21e-06 ***
## Age_y -2.340e+02 7.098e+01 -3.296 0.00102 **
## Salary 2.195e-02 9.537e-04 23.016 < 2e-16 ***
## Ownhome_b 2.227e+01 5.093e+01 0.437 0.66201
## Gender_b -6.897e+01 4.489e+01 -1.536 0.12475
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 679.1 on 994 degrees of freedom
## Multiple R-squared: 0.5033, Adjusted R-squared: 0.5008
## F-statistic: 201.4 on 5 and 994 DF, p-value: < 2.2e-16
fit <- lm(AmountSpent ~ 0+ Age_m+Age_y + Salary + Ownhome_b + Gender_b , data = d_dimar)
summary(fit)
##
## Call:
## lm(formula = AmountSpent ~ 0 + Age_m + Age_y + Salary + Ownhome_b +
## Gender_b, data = d_dimar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2510.3 -297.4 -29.8 293.9 3507.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Age_m -2.224e+02 5.389e+01 -4.127 3.98e-05 ***
## Age_y -9.366e+01 4.843e+01 -1.934 0.0534 .
## Salary 2.348e-02 7.682e-04 30.570 < 2e-16 ***
## Ownhome_b 5.815e+01 4.932e+01 1.179 0.2386
## Gender_b -1.193e+01 3.972e+01 -0.300 0.7639
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 681.2 on 995 degrees of freedom
## Multiple R-squared: 0.8079, Adjusted R-squared: 0.8069
## F-statistic: 836.8 on 5 and 995 DF, p-value: < 2.2e-16
Model 1, which includes an intercept, reveals that Age (middle and young) and Salary significantly influence Amount Spent. Owning a home and gender (binary) are not significant predictors. This model explains about 50.33% of the variance in spending. Model 2, without an intercept, suggests a higher explanatory power (80.79%) with all variables directly influencing Amount Spent. However, only Age (middle) and Salary remain significant predictors. The substantial increase in R-squared in Model 2 might be due to the absence of an intercept, forcing the model to attribute all variance to the included predictors, which can lead to overestimation of their effects.
fit <- lm(AmountSpent ~ Married_b+Location_b+Age_m+Age_y + Salary + Ownhome_b + Gender_b + Hist_m + Hist_h + Hist_m , data = d_dimar)
summary(fit)
##
## Call:
## lm(formula = AmountSpent ~ Married_b + Location_b + Age_m + Age_y +
## Salary + Ownhome_b + Gender_b + Hist_m + Hist_h + Hist_m,
## data = d_dimar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2215.05 -335.47 -36.96 255.35 3059.31
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.065e+02 8.151e+01 7.440 2.18e-13 ***
## Married_b -2.117e+01 5.492e+01 -0.386 0.699938
## Location_b -4.849e+02 4.346e+01 -11.156 < 2e-16 ***
## Age_m -1.929e+02 5.510e+01 -3.500 0.000485 ***
## Age_y -1.966e+02 6.465e+01 -3.042 0.002416 **
## Salary 1.886e-02 1.231e-03 15.316 < 2e-16 ***
## Ownhome_b 2.093e+01 4.511e+01 0.464 0.642732
## Gender_b -5.180e+01 3.988e+01 -1.299 0.194256
## Hist_m -1.451e+02 5.101e+01 -2.845 0.004538 **
## Hist_h 4.223e+02 5.796e+01 7.286 6.49e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 599.7 on 990 degrees of freedom
## Multiple R-squared: 0.6142, Adjusted R-squared: 0.6107
## F-statistic: 175.1 on 9 and 990 DF, p-value: < 2.2e-16
fit <- lm(AmountSpent ~ 0+ Married_b+Location_b+Age_m+Age_y + Salary + Ownhome_b + Gender_b + Hist_m + Hist_h + Hist_m , data = d_dimar)
summary(fit)
##
## Call:
## lm(formula = AmountSpent ~ 0 + Married_b + Location_b + Age_m +
## Age_y + Salary + Ownhome_b + Gender_b + Hist_m + Hist_h +
## Hist_m, data = d_dimar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2441.63 -302.17 22.27 302.10 3152.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Married_b -7.275e+01 5.596e+01 -1.300 0.19385
## Location_b -3.712e+02 4.179e+01 -8.882 < 2e-16 ***
## Age_m -5.594e+01 5.334e+01 -1.049 0.29454
## Age_y 1.403e+02 4.738e+01 2.962 0.00313 **
## Salary 2.254e-02 1.158e-03 19.470 < 2e-16 ***
## Ownhome_b 9.737e+01 4.511e+01 2.158 0.03114 *
## Gender_b 8.014e+01 3.668e+01 2.185 0.02914 *
## Hist_m -4.537e+01 5.055e+01 -0.898 0.36958
## Hist_h 4.949e+02 5.867e+01 8.435 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 615.9 on 991 degrees of freedom
## Multiple R-squared: 0.8436, Adjusted R-squared: 0.8422
## F-statistic: 593.8 on 9 and 991 DF, p-value: < 2.2e-16
Model 1, with an intercept, shows that Location, Age, and Salary are significant predictors of Amount Spent, with Location having a notably negative effect. The model explains 61.42% of the variance in spending. In contrast, Model 2, omitting the intercept, attributes more variance (84.36%) to the included predictors and identifies additional significant variables like Ownhome and Gender. The increase in R-squared in Model 2 suggests that when all variance is attributed to the predictors, their effects are overestimated. Model 1’s approach is more conservative and potentially more realistic, acknowledging other unaccounted factors influencing spending.
History attributes.